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Abstract In this paper we address the problem of track- 
ing non-rigid objects whose local appearance and mo- 
tion changes as a function of time. This class of objects 
includes dynamic textures such as steam, fire, smoke, 
water, etc., as well as articulated objects such as humans 
performing various actions. We model the temporal evo- 
lution of the object's appearance/motion using a Linear 
Dynamical System (LDS). We learn such models from 
sample videos and use them as dynamic templates for 
tracking objects in novel videos. We pose the problem of 
tracking a dynamic non-rigid object in the current frame 
as a maximum a-posteriori estimate of the location of 
the object and the latent state of the dynamical system, 
given the current image features and the best estimate 
of the state in the previous frame. The advantage of 
our approach is that we can specify a-priori the type of 
texture to be tracked in the scene by using previously 
trained models for the dynamics of these textures. Our 
framework naturally generalizes common tracking meth- 
ods such as SSD and kernel-based tracking from static 
templates to dynamic templates. We test our algorithm 
on synthetic as well as real examples of dynamic tex- 

R. Chaudhry 

Center for Imaging Science 
Johns Hopkins University 
Baltimore, MD 
Tel.: +1-410-516-4095 
Fax: +1-410-516-4594 
E-mail: rizwanch@cis.jhu.edu 

G. Hager 

Johns Hopkins University 

Baltimore, MD 

E-mail: hager@cs.jhu.edu 

R. Vidal 

Center for Imaging Science 
Johns Hopkins University 
Baltimore, MD 
E-mail: rvidal@cis.jhu.edu 



tures and show that our simple dynamics-based trackers 
perform at par if not better than the state-of-the-art. 
Since our approach is general and applicable to any 
image feature, we also apply it to the problem of human 
action tracking and build action-specific optical flow 
trackers that perform better than the state-of-the-art 
when tracking a human performing a particular action. 
Finally, since our approach is generative, we can use 
a-priori trained trackers for different texture or action 
classes to simultaneously track and recognize the texture 
or action in the video. 

Keywords Dynamic Templates • Dynamic Textures • 
Human Actions • Tracking • Linear Dynamical Systems • 
Recognition 



1 Introduction 

Object tracking is arguably one of the most important 
and actively researched areas in computer vision. Ac- 
curate object tracking is generally a pre-requisite for 
vision-based control, surveihance and object recognition 
in videos. Some of the challenges to accurate object 
tracking are moving cameras, changing pose, scale and 
velocity of the object, occlusions, non-rigidity of the 
object shape and changes in appearance due to ambi- 
ent conditions. A very large number of techniques have 
been proposed over the last few decades, each trying to 
address one or more of these challenges under different 
assumptions. The comprehensive survey by Yilmaz et al. 
(2006) provides an analysis of over 200 publications in 
the general area of object tracking. 

In this paper, we focus on tracking objects that 
undergo non-rigid transformations in shape and appear- 
ance as they move around in a scene. Examples of such 
objects include fire, smoke, water, and fiuttering fiags, as 
well as humans performing different actions. Collectively 
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called dynamic templates, these objects are fairly com- 
mon in natural videos. Due to their constantly evolving 
appearance, they pose a challenge to state-of-the-art 
tracking techniques that assume consistency of appear- 
ance distributions or consistency of shape and contours. 
However, the change in appearance and motion profiles 
of dynamic templates is not entirely arbitrary and can 
be explicitly modeled using Linear Dynamical Systems 
(LDS). Standard tracking methods either use subspace 
models or simple Gaussian models to describe appear- 
ance changes of a mean template. Other methods use 
higher-level features such as skeletal structures or con- 
tour deformations for tracking to reduce dependence on 
appearance features. Yet others make use of foreground- 
background classifiers and learn discriminative features 
for the purpose of tracking. However, all these methods 
ignore the temporal dynamics of the appearance changes 
that are characteristic to the dynamic template. 

Over the years, several methods have been developed 
for segmentation and recognition of dynamic templates, 
in particular dynamic textures. However, to the best of 
our knowledge the only work that explicitly addresses 
tracking of dynamic textures was done by Peteri (2010). 
As we will describe in detail later, this work also does 
not consider the temporal dynamics of the appearance 
changes and does not perform well in experiments. 

Paper Contributions and Outline. In the proposed 
approach, we model the temporal evolution of the ap- 
pearance of dynamic templates using Linear Dynamical 
Systems (LDSs) whose parameters are learned from 
sample videos. These LDSs will be incorporated in a 
kernel based tracking framework that will allow us to 
track non-rigid objects in novel video sequences. In the 
remaining part of this section, we will review some of 
the related works in tracking and motivate the need 
for dynamic template tracking method. We will then 
review static template tracking in §2. In §3, we pose the 
tracking problem as the maximum a-posteriori estimate 
of the location of the template as well as the internal 
state of the LDS, given a kernel- weighted histogram 
observed at a test location in the image and the internal 
state of the LDS at the previous frame. This results 
in a novel joint optimization approach that allows us 
to simultaneously compute the best location as well as 
the internal state of the moving dynamic texture at the 
current time instant in the video. We then show how our 
proposed approach can be used to perform simultaneous 
tracking and recognition in §4. In §5, we first evaluate 
the convergence properties of our algorithm on synthetic 
data before validating it with experimental results on 
real datasets of Dynamic Textures and Human Activ- 
ities in §6, §7 and §8. Finally, we will mention future 
research directions and conclude in §9. 



Prior Work on Tracking Non-Rigid and Articu- 
lated Objects. In the general area of tracking, Isard 
and Blake (1998) and North et al. (2000) hand craft 
models for object contours using splines and learn their 
dynamics using Expectation Maximization (EM). They 
then use particle filtering and Markov Chain Monte- 
Carlo methods to track and classify the object motion. 
However for most of the cases, the object contours do 
not vary significantly during the tracking task. In the 
case of dynamic textures, generally there is no well- 
defined contour and hence this approach is not directly 
applicable. Black and Jepson (1998) propose using a 
robust appearance subspace model for a known object 
to track it later in a novel video. However there are 
no dynamics associated to the appearance changes and 
in each frame, the projection coefficients are computed 
independently from previous frames. Jepson et al. (2001) 
propose an EM-based method to estimate parameters 
of a mixture model that combines stable object appear- 
ance, frame- to- frame variations, and an outlier model 
for robustly tracking objects that undergo appearance 
changes. Although, the motivation behind such a model 
is compelling, its actual application requires heuristics 
and a large number of parameters. Moreover, dynamic 
textures do not have a stable object appearance model, 
instead the appearance changes according to a distinct 
Gauss-Markov process characteristic to the class of the 
dynamic texture. 

Tracking of non-rigid objects is often motivated by 
the application of human tracking in videos. In Pavlovic 
et al. (1999), a Dynamic Bayesian Network is used to 
learn the dynamics of human motion in a scene. Joint 
angle dynamics are modeled using switched linear dy- 
namical systems and used for classification, tracking 
and synthesis of human motion. Although, the track- 
ing results for human skeletons are impressive, extreme 
care is required to learn the joint dynamic models from 
manually extracted skeletons or motion capture data. 
Moreover a separate dynamical system is learnt for each 
joint angle instead of a global model for the entire ob- 
ject. Approaches such as Leibe et al. (2008) maintain 
multiple hypotheses for object tracks and continuously 
refine them as the video progresses using a Minimum 
Description Length (MDL) framework. The work by Lim 
et al. (2006) models dynamic appearance by using non- 
linear dimensionality reduction techniques and learns 
the temporal dynamics of these low-dimensional repre- 
sentation to predict future motion trajectories. Nejhum 
et al. (2008) propose an online approach that deals with 
appearance changes due to articulation by updating the 
foreground shape using rectangular blocks that adapt to 
find the best match in every frame. However foreground 
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appearance is assumed to be stationary throughout the 
video. 

Recently, classification based approaches have been 
proposed in Grabner et al. (2006) and Babenko et al. 
(2009) where classifiers such as boosting or Multiple 
Instance Learning are used to adapt foreground vs back- 
ground appearance models with time. This makes the 
tracker invariant to gradual appearance changes due 
to object rotation, illumination changes etc. This dis- 
criminative approach, however, does not incorporate 
an inherent temporal model of appearance variations, 
which is characteristic of, and potentially very useful, 
for dynamic textures. 

In summary, all the above works lack a unified frame- 
work that simultaneously models the temporal dynamics 
of the object appearance and shape as well as the mo- 
tion through the scene. Moreover, most of the works 
concentrate on handhng the appearance changes due to 
articulation and are not directly relevant to dynamic 
textures where there is no articulation. 

Prior Work on Tracking Dynamic Templates. Re- 
cently, Peteri (2010) propose a first method for tracking 
dynamic textures using a particle filtering approach 
similar to the one presented in Isard and Blake (1998). 
However their approach can best be described as a static 
template tracker that uses optical fiow features. The 
method extracts histograms for the magnitude, direc- 
tion, divergence and curl of the optical fiow field of the 
dynamic texture in the first two frames. It then assumes 
that the change in these fiow characteristics with time 
can simply be modeled using a Gaussian distribution 
with the initially computed histograms as the mean. 
The variance of this distribution is selected as a param- 
eter. Furthermore, they do not model the characteristic 
temporal dynamics of the intensity variations specific to 
each class of dynamic textures, most commonly modeled 
using LDSs. As we will also show in our experiments, 
their approach performs poorly on several real dynamic 
texture examples. 

LDS-based techniques have been shown to be ex- 
tremely valuable for dynamic texture recognition (Saisan 
et al., 2001; Doretto et al., 2003; Chan and Vasconcelos, 
2007; Ravichandran et al., 2009), synthesis (Doretto 
et al., 2003), and registration (Ravichandran and Vidal, 
2008). They have also been successfully used to model 
the temporal evolution of human actions for the pur- 
pose of activity recogntion (Bissacco et al., 2001, 2007; 
Chaudhry et al., 2009). Therefore, it is only natural to 
assume that such a representation should also be useful 
for tracking. 

Finally, Vidal and Ravichandran (2005) propose a 
method to jointly compute the dynamics as well as the 
optical flow of a scene for the purpose of segmenting 



moving dynamic textures. Using the Dynamic Texture 
Constancy Constraint (DTCC), the authors show that 
if the motion of the texture is slow, the optical flow 
corresponding to 2-D rigid motion of the texture (or 
equivalently the motion of the camera) can be computed 
using a method similar to the Lucas-Kanade optical flow 
algorithm. In principle, this method can be extended 
to track a dynamic texture in a framework similar to 
the KLT tracker. However, the requirement of having 
a slow- moving dynamic texture is particularly strict, 
especially for high-order systems and would not work in 
most cases. Moreover, the authors do not enforce any 
spatial coherence of the moving textures, which causes 
the segmentation results to have holes. 

In light of the above discussion, we posit that there 
is a need to develop a principled approach for tracking 
dynamic templates that explicitly models the character- 
istic temporal dynamics of the appearance and motion. 
As we will show, by incorporating these dynamics, our 
proposed method achieves superior tracking results as 
well as allows us to perform simultaneous tracking and 
recognition of dynamic templates. 

2 Review of Static Template Tracking 

In this section, we will formulate the tracking problem 
as a maximum a-posteriori estimation problem and show 
how standard static template tracking methods such 
as Sum-of-Squared-Differences (SSD) and kernel-based 
tracking are special cases of this general problem. 

Assume that we are given a static template X : Q ^ 
R, centered at the origin on the discrete pixel grid, 
i? C M^. At each time instant, t, we observe an image 
frame, yt : J-" ^ M, where C is the discrete pixel 
domain of the image. As the template moves in the scene, 
it undergoes a translation, 1^ G M^, from the origin of 
the template reference frame Q. Moreover, assume that 
due to noise in the observed image the intensity at each 
pixel in F is corrupted by i.i.d. Gaussian noise with 
mean 0, and standard deviation ay. Hence, for each 
pixel location z G i? + It = {z' + 1^ : G i?}, we have, 

yt(z) =X(z-lt)+Wt(z), where Wt(z) A/'(0, dy). (1) 

Therefore, the likelihood of the image intensity at pixel 
z G i? + U is p(yt(z)|lt) = gy^(z)(X(z - lt),cr|^), where 

ex(/i, ^) = exp {-i(x - fiVU-\^ -A 

(27r)2|Z'|2 t z J 

is the n-dimensional Gaussian pdf with mean /i and 
covariance U. Given yt = [yt(z)]z^jF, i.e., the stack of 
all the pixel intensities in the frame at time t, we would 
like to maximize the posterior, p{lt\yt)- Assuming a 
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uniform background, i.e.,p(y^(z)|l^) = 1/K if z — It ^f2 
and a uniform prior for 1^, i.e.,p(lt) = |^|~^, we have, 



P(it|y*) 



^ p{yt\lt)p{lt) 
p{yt) 

cxp(yt|i*)= n p(y*(^)li*) 



(2) 



zGi7+lt 



(X exp 



{"4 ^ (y*(^)-^(-i*))^|- 



The optimum value, 1^ will maximize the log posterior 
and after some algebraic manipulations, we get 



it = argmin ^ (yt(z) - I{z - k))^. 



(3) 



zef2-\-h 



Notice that with the change of variable, = z — 1^, we 
can shift the domain from the image pixel space to 
the template pixel space i7, to get 1^ = argmiuj^ 0(1^), 
where 



0(1,)= ^(y,(z' + l,)-J(z'))^ 



z'Gi7 



Eq. (4) is the well known optimization function used 
in Sum of Squared Differences (SSD)-based tracking of 
static templates. The optimal solution is found either 
by searching brute force over the entire image frame 
or, given an initial estimate of the location, by perform- 
ing gradient descent iterations: IJ^^ = Vj. — 7Vi^0(lJ). 
Since the image intensity function, yt is non-convex, a 
good initial guess, 1^, is very important for the gradient 
descent algorithm to converge to the correct location. 
Generally, 1^ is initialized using the optimal location 
found at the previous time instant, i.e.,lt_i, and Iq is 
hand set or found using an object detector. 

In the above description, we have assumed that we 
observe the intensity values, yt, directly. However, to 
develop an algorithm that is invariant to nuisance fac- 
tors such as changes in contrast and brightness, object 
orientations, etc., we can choose to compute the value 
of a more robust function that also considers the inten- 
sity values over a neighborhood r{z) C of the pixel 
location z. 



f,(z) = /([y,(z')Wer(z)), /:: 



(5) 



where [yt{'^')]z'er{z) represents the stack of intensity 
values of all the pixel locations in r{z). We can therefore 
treat the value of ft(z) as the observed random variable 
at the location z instead of the actual intensities, yt(z). 

Notice that even though the conditional probability 
of the intensity of individual pixels is Gaussian, as in 
Eq. (1), under the (possibly) non-linear transformation, 
/, the conditional probability of ft(z) will no longer be 



Gaussian. However, from an empirical point of view, 
using a Gaussian assumption in general provides very 
good results. Therefore, due to changes in the location 
of the template, we observe ft(z) = f{[T{z')]z'er{z-\t))~^ 
w{(z), where w{(z) A/'(0, cr^) is isotropic Gaussian 
noise. 

Following the same derivation as before, the new 
cost function to be optimized becomes, 

0{h) = E ll/([yt(z')]z'er(z+i,)) - /([X(z')]z'er(z))f 

= ||F(y,(l,))-F(J)f, (6) 

where, 

Fiytih)) = [/([y*(z')]z'6r(z+i,))]zefi 
F(J) = [/([J(z')]z'er(z))]z6« 

By the same argument as in Eq. (4), 1^ = argmiuj^ 0(lt) 
also maximizes the posterior, p(lt|F(yt)), where 



(4) F{yt) = [/([yt(z')]z'er(z)] 



zeTj 



(7) 



is the stack of all the function evaluations with neigh- 
borhood size r over all the pixels in the frame. 

For the sake of simplicity, from now on as in Eq. 
(6), we will abuse the notation and use yt{h) to denote 
the stacked vector [yt{'^^)]z'er{z-\-\t)j yt to denote 
the full frame, [yt(z)]z^jr. Moreover, assume that the 
ordering of pixels in i? is in column- wise format, de- 
noted by the set {!,..., N}. Finally, if the size of the 
neighborhood, is equal to the size of the template, X, 
i.e., 1 7^1 = |i7|, / will only need to be computed at the 
central pixel of i7, shifted by It, i.e.. 



0(l,) = ||/(y,(l,))-/(X)ll^ 



(8) 



Kernel based Tracking. One special class of func- 
tions that has commonly been used in kernel-based track- 
ing methods (Comaniciu and Meer, 2002; Comaniciu 
et al., 2003) is that of kernel- weighted histograms of 
intensities. These functions have very useful properties 
in that, with the right choice of the kernel, they can 
be made either robust to variations in the pose of the 
object, or sensitive to certain discriminatory character- 
istics of the object. This property is extremely useful 
in common tracking problems and is the reason for the 
wide use of kernel-based tracking methods. In particular, 
a kernel- weighted histogram, p = [pi, . . . , Pb]^ with B 
bins, u = computed at pixel location It, is 

defined as, 



5„(yt(lt)) = - ^ K{z)S{b{yt{z + h)) - u), 



(9) 



zGi7 
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where 6 is a binning function for the intensity y^(z) at 
the pixel z, 5 is the discrete Kronecker delta function, 
and hi = XlzGi? ^(^) ^ normalization constant such 
that the sum of the histogram equals 1. One of the more 
commonly used kernels is the Epanechnikov kernel, 



K{z) 



\Hz\\ 



\\H4 < 1 
otherwise 



(10) 



where H = diag([r~^, c~^]) is the bandwidth scaling 
matrix of the kernel corresponding to the size of the 
template, i.e., \ f2\ = r x c. 

Using the fact that we observe F = y^, the entry- 
wise square root of the histogram, we get the Matusita 
metric between the kernel weighted histogram computed 
at the current location in the image frame and that of 
the template: 



0{h) = \\^/MJdU))-^/piI)f 



(11) 



Hager et al. (2004) showed that the minimizer of the 
objective function in Eq. (11) is precisely the solution 
of the meanshift tracker as originally proposed by Co- 
maniciu and Meer (2002) and Comaniciu et al. (2003). 
The algorithm in Hager et al. (2004) then proceeds by 
computing the optimal It that minimizes Eq. (11) using 
a Newton-like approach. We refer the reader to Hager 
et al. (2004) for more details. Hager et al. (2004) then 
propose using multiple kernels and Fan et al. (2007) 
propose structural constraints to get unique solutions 
in difficult cases. All these formulations eventually boil 
down to the solution of a problem of the form in Eq. 
(8). 

Incorporating Location Dynamics. The generative 
model in Eq. (2) assumes a uniform prior on the prob- 
ability of the location of the template at time t and 
that the location at time t is independent of the loca- 
tion at time t — 1. If applicable, we can improve the 
performance of the tracker by imposing a known motion 
model. It = ^(It-i) + wf , such as constant velocity or 
constant acceleration. In this case, the likelihood model 
is commonly appended by. 



(12) 



From here, it is a simple exercise to see that the max- 
imum a-posteriori estimate of It given all the frames, 
yo, . . . , Yt can be computed by the extended Kalman 
filter or particle filters since /, in Eq. (8), is a function of 
the image intensities and therefore a non-linear function 
on the pixel domain. 



3 Tracking Dynamic Templates 

In the previous section we reviewed kernel-based meth- 
ods for tracking a static template X : i? ^ R. In this 
section we propose a novel kernel-based framework for 
tracking a dynamic template Xt : i? ^ M. For ease of ex- 
position, we derive the framework under the assumption 
that the location of the template It is equally likely on 
the image domain. For the case of a dynamic prior on 
the location, the formulation will result in an extended 
Kalman or particle filter as briefly mentioned at the end 
of §2. 

We model the temporal evolution of the dynamic 
template Xt using Linear Dynamical Systems (LDSs). 
LDSs are represented by the tuple (/i, A, C, B) and sat- 
isfy the following equations for all time t: 

Xt = Axt-i+5vt, (13) 

Xt =/i + Cxt. (14) 

HereXt G M'^' is the stacked vector, [Xt(z)]z^^, of image 
intensities of the dynamic template at time and Xt is 
the (hidden) state of the system at time t. The current 
state is linearly related to the previous state by the state 
transition matrix A and the current output is linearly 
related to the current state by the observation matrix C. 
Vt is the process noise, which is assumed to be Gaussian 
and independent from Xt. Specifically, Bvt ~ A/'(0,(5), 
where Q BB^ . 

Tracking dynamic templates requires knowledge of 
the system parameters, [ji^ A^C^B)^ for dynamic tem- 
plates of interest. Naturally, these parameters have to be 
learnt from training data. Once these parameters have 
been learnt, they can be used to track the template in a 
new video. However the size, orientation, and direction 
of motion of the template in the test video might be very 
different from that of the training videos and therefore 
our procedure will need to be invariant to these changes. 
In the following, we will propose our dynamic template 
tracking framework by describing in detail each of these 
steps, 

1. Learning the system parameters of a dynamic tem- 
plate from training data, 

2. Tracking dynamic templates of the same size, orien- 
tation, and direction of motion as training data, 

3. Discussing the convergence properties and parameter 
tuning, and 

4. Incorporating invariance to size, orientation, and 
direction of motion. 



3.1 LDS Parameter Estimation 

We will first describe the procedure to learn the system 
parameters, (/i, ^4, C, 5), of a dynamic template from a 
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training video of that template. Assuming that we can 
manually, or semi-automatically, mark a bounding box 
of size \Q\ = r X c rows and columns, which best covers 
the spatial extent of the dynamic template at each frame 
of the training video. The center of each box gives us 
the location of the template at each frame, which we use 
as the ground truth template location. Next, we extract 
the sequence of column-wise stacked pixel intensities, 
I = [Xi, . . . ,Xn] corresponding to the appearance of the 
template at each time instant in the marked bounding 
box. We can then compute the system parameters using 
the system identification approach proposed in Doretto 
et al. (2003). Briefly, given the sequence, X, we compute 
a compact, rank n, singular value decompostion (SVD) 
of the matrix, X = [Xi — /i, . . . ,Xn — /j] = U EV~^ . Here 
/i = Xl^Li ^ is the order of the system and is 

a parameter. For all our experiments, we have chosen 
n = 5. We then compute C = and the state sequence 
= EV^, where = [x^, , . . . , x^,]. Given , 
the matrix A can be computed using least-squares as 
A = X2^ {X^~^y , where X^ represents the pseudo- 
inverse of X. Also, Q = ^t=i^ ^K^t)^ where = 
Bvf = Xt+i — x^. B is computed using the Cholesky 
factorization of Q = BB^ . 




Fig. 1 Illustration of the dynamic template tracking problem. 




Fig. 2 Graphical representation for the generative model of 
the observed template. 



3.2 Tracking Dynamic Templates 

Problem Formulation. We will now formulate the 
problem of tracking a dynamic template of size = 
r X c, with known system parameters (/i, A, C, B). Given 
a test video, at each time instant, t, we observe an 
image frame, : ^ R, obtained by translating the 
template Xt by an amount It G from the origin of the 
template reference frame Q. Previously, at time instant 
t — 1, the template was observed at location lt_i in 
frame y^-i. In addition to the change in location, the 
intensity of the dynamic template changes according 
to Eqs. (13-14). Moreover, assume that due to noise 
in the observed image the intensity at each pixel in 
is corrupted by i.i.d. Gaussian noise with mean 0, and 
standard deviation ay. Therefore, the intensity at pixel 
z G J-" given the location of the template and the current 
state of the dynamic texture is 

y,(z) =X,(z-l,)+w,(z) (15) 
= /i(z - 1,) + C(z - l,)^x, + w,(z), (16) 

where the pixel z is used to index ji in Eq. (13) according 
to the ordering in i7, e.g., in a column- wise fashion. 
Similarly, C(z)^ is the row of the C matrix in Eq. (13) 
indexed by the pixel z. Fig. 1 illustrates the tracking 
scenario and Fig. 2 shows the corresponding graphical 



model representation^. We only observe the frame, yt 
and the appearance of the frame is conditional on the 
location of the dynamic template, 1^ and its state, x^. 

As described in §2, rather than using the image 
intensities yt as our measurements, we compute a kernel- 
weighted histogram centered at each test location 1^, 

P«(yt(it)) = - E K{^)myt{^ + It)) - u). (17) 

In an entirely analogous fashion, we compute a kernel- 
weighted histogram of the template 

p„(It(xt)) = - V K(z),5(6(m(z) + C(z)Tx() - «),(18) 
n ^ — ^ 

where we write Xt{^t) to emphasize the dependence of 
the template Xt on the latent state x^, which needs to 
be estimated together with the template location 1^. 

Since the space of histograms is a non-Euclidean 
space, we need a metric on the space of histograms to be 
able to correctly compare the observed kernel-weighted 
histogram with that generated by the template. One 
convenient metric on the space of histograms is the 

^ Notice that since we have assumed that the location of 
the template at time t is independent of its location at time 
t — 1, there is no link from It-i to It in the graphical model. 
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Matusita metric, 




which was also previously used in Eq. (11) for the static 
template case by Hager et al. (2004). Therefore, we 
approximate the probability of the square root of each 
entry of the histogram as, 

P(\/Pn(yt)|lt,xO ^ gy^ p, (y ^ (1^ ) ) ( \/Pn (^t (Xt ) ) , ) , 

(20) 

where afj is the variance of the entries of the histogram 
bins. The tracking problem therefore results in the max- 
imum a-posteriori estimation, 

(it,x^) = argmaxp(U,Xt| V'p(yi), . . . , \/p(yt)) (21) 

where {p(yi)}-=i are the kernel- weighted histograms 
computed at each image location in all the frames with 
the square-root taken element-wise. Deriving an optimal 
non-linear filter for the above problem might not be 
computationally feasible and we might have to resort to 
particle filtering based approaches. However, as we will 
explain, we can simplify the above problem drastically 
by proposing a greedy solution that, although not prov- 
ably optimal, is computationally efficient. Moreover, as 
we will show in the experimental section, this solution 
results in an algorithm that performs at par or even 
better than state-of-the-art tracking methods. 

Bayesian Filtering. Define = {v'p(yi) . . . \/p(yt)}, 
and consider the Bayesian filter derivation for Eq. (21): 

p(l,,x,|P,) =p(l,,x,|P,_i,y^) 

^ p(\/p(yt)|it,xt)p(it,xt|Pt-i) 

p(v/p(yO|Pt-i) 
o^p(\/p(yt)|i^,x^)p(it,xt|Pt_i) 

,Xt,X^_i|P^_i)dXt_i 

= p(\/p(yt)|i^,x^). 

/ p(U,Xt|x^_i)p(xt_i|Pt_i)dxt_i 

/ p(xt|xt_i)p(xt_i|Pt_i)dxt_i, 

where we have assumed a uniform prior p{lt) = l-^l"^- 
Assuming that we have full confidence in the estimate 
Xt-i of the state at time t — 1, we can use the greedy 




Fig. 3 Graphical Model for the approximate generative model 
of the observed template 

posterior, p{xt-i\'Pt-i) = ^^(xt-i = Xt_i), to greatly 
simplify Eq. (22) as, 

p(lt,Xt|Pt) (xp(Vp(y^|lt,Xt)p(xt|x^_i =Xt_i)p(U) 

P{V p{yt)\h^^t)p{^t\^t-i =xt_i). (22) 

Fig. 3 shows the graphical model corresponding to this 
approximation. 

After some algebraic manipulations, we arrive at, 

(it, xt) = argmin 0(1^, x^) (23) 

lt,Xt 

where, 

o(it,x,) =^||Vp(y.(iO)-Vp(M + ^x,)||'+ 

i(x, - A^t-iVQ-\^t - Ax,_i). (24) 

Simultaneous State Estimation and Tracking. To 

derive a gradient descent scheme, we need to take deriva- 
tives of O w.r.t.x^. However, the histogram function, p 
is not differentiable w.r.t. x^ because of the 5 function 
and hence we cannot compute the required derivative 
of Eq. (24). Instead of p, we propose to use where ( 
is a continuous histogram function defined as, 

C«(yt(it))=^E^(^) 

(0n-i(yt(z+i,))-0n(yt(z+it))) , (25) 

where (l)u{s) = (1 + exp{— cr(5 — r('u))})~"^ is the sig- 
moid function. With a suitably chosen a {a = 100 in 
our case), the difference of the two sigmoids is a continu- 
ous and differentiable function that approximates a step 
function on the grayscale intensity range [r{u — 1), r{u)]. 
For example, for a histogram with B bins, to uniformly 
cover the grayscale intensity range from to 1, we have 
r{u) = ^. The difference, (j)u-i{y) — 4^u{y)i will there- 
fore give a value close to 1 when the pixel intensity is 
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Binning function assigning bins to pixel intensity ranges. B = 10 



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 
Pixel intensity 

(a) Binning function, for a histogram 

with 10 bins, B = 10. 



(b) Exact binning 



(c) Approximate binning 



Fig. 4 Binning function for B = 10 bins and the correspond- 
ing non-differentiable exact binning strategy vs our proposed 
differentiable but approximate binning strategy, for bin num- 
ber u = 5. 



in the range [^^^, and close to otherwise, thus 
contributing to the u-th bin. Fig. 4(a) illustrates the 
binning function b{.) in Eqs. (9, 17, 18) for a specific case 
where the pixel intensity range [0, 1] is equally divided 
into B = 10 bins. Fig. 4(b) shows the non-differentiable 
but exact binning function, S{b{.),u)^ for bin number 
u = 5, whereas Fig. 4(c) shows the corresponding pro- 
posed approximate but differentiable binning function, 
{(j)u-i ~ 0u)(-)- proposed function 

responds with a value close to 1 for intensity values 
between r(5 — 1) = 0.4, and r(5) = 0.5. The spatial 
kernel weighting of the values is done in exactly the 
same way as for the non-continuous case in Eq. (9). 

We can now find the optimal location and state at 
each time-step by performing gradient descent on Eq. 
(24) with p replaced by (. This gives us the following 
iterations (see Appendix A for a detailed derivation) 



27 



(26) 



where, 

a= VC(yt(lt))-\/C(M + Cxt) 



L = 



2a% 



diag(C(yt(lt)))-^U^J 



K 



M 



2al 



diag(C(M + Cxt))- 5 (^O"" -diag(K)C. 



The index i in Eq. (26) represents evaluation of the 
above quantities using the estimates (lj,xj) at iteration 
i. Here, Jk is the Jacobian of the kernel 

JK = [Vi^(zi)...VK(z^)], 

and U = [ui, U2, . • . , u^] is a real- valued sifting matrix 
(analogous to that in Hager et al. (2004)) with. 



>i-i(y^(zi)) -0^-(y^(zi)) ■ 
^j-i(yt(z2)) -0j(yt(z2)) 

,^j-i(y^(zAr)) - (t)j{yt{zN)). 



where the numbers 1, . . . , A/" provide an index in the 
pixel domain of i7, as previously mentioned. = 
[<P[,<p2, • . . , ^ IR^^^ is a matrix composed of deriva- 
tives of the difference of successive sigmoid functions 
with, 



(/)^.)(Mzi) + C(zi)"rx,) 

(/)^.)(MZ2)+C(Z2)"^X,) 



M-i-^;)(Mziv) + c(z^)^x,). 



(27) 



Initialization. Solving Eq. (26) iteratively will simul- 
taneously provide the location of the dynamic texture in 
the scene as well as the internal state of the dynamical 
system. However, notice that the function O in Eq. (24) 
is not convex in the variables and 1^, and hence the 
above iterative solution can potentially converge to local 
minima. To alleviate this to some extent, it is possible to 
choose a good initialization of the state and location as 
= Ax^-i, and 1^ = l^-i. To initialize the tracker in 
the first frame, we use Iq as the initial location marked 
by the user, or determined by a detector. To initialize 
xo, we use the pseudo-inverse computation. 



xo 



Ct(yo(lo)-/i), 



(28) 



which coincides with the maximum a-posteriori estimate 
of the initial state given the correct initial location and 
the corresponding texture at that time. A good value 
of the step-size 7 can be chosen using any standard 
step-size selection procedure Gill et al. (1987) such as 
the Armijo step-size strategy. 
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We call the above method in its entirety, Dynamic 
Kernel SSD Tracking (DK-SSD-T). For ease of expo- 
sition, we have considered the case of a single kernel, 
however the derivation for multiple stacked and additive 
kernels Hager et al. (2004), and multiple collaborative 
kernels with structural constraints Fan et al. (2007) fol- 
lows similarly. 

3.3 Convergence analysis and parameter selection 

Convergence of Location. We would first like to 
discuss the case when the template is static, and can be 
represented completely using the mean, /i. This is simi- 
lar to the case when we can accurately synthesize the 
expected dynamic texture at a particular time instant 
before starting the tracker. In this case, our proposed 
approach is analogous to the original meanshift algo- 
rithm (Comaniciu et al., 2003) and follows all the (local) 
convergence guarantees for that method. 

Convergence of State. The second case, concerning 
the convergence of the state estimate is more interesting. 
In traditional filtering approaches such as the Kalman 
filter, the variance in the state estimator is minimized at 
each time instant given new observations. However, in 
the case of non-linear dynamical systems, the Extended 
Kalman Filter (EKF) only minimizes the variance of 
the linearized state estimate and not the actual state. 
Particle filters such as condensation (Isard and Blake, 
1998) usually have asymptotic convergence guarantees 
with respect to the number of particles and the number 
of time instants. Moreover efficient resampling is needed 
to deal with cases where all but one particle have non- 
zero probabilities. Our greedy cost function on the other 
hand aims to maximize the posterior probability of the 
state estimate at each time instant by assuming that 
the previous state is estimated correctly. This might 
seem like a strong assumption but as our experiments 
in §5 will show that with the initialization techniques 
described earlier, we always converge to the correct 
state. 

Parameter Tuning. The variance of the values of in- 
dividual histogram bins, cr|^, could be empirically com- 
puted by using the EM algorithm, given kernel- weighted 
histograms extracted from training data. However, we 
fixed the value at afj = 0.01 for all our experiments 
and this choice consistently gives good tracking per- 
formance. The noise parameters, ajj and Q, can also 
be analyzed as determining the relative weights of the 
two terms in the cost function in Eq. (24). The first 
term in the cost function can be interpreted as a recon- 
struction term that computes the difference between the 
observed kernel-weighted histogram and the predicted 



kernel weighted histogram given the state of the system. 
The second term can similarly be interpreted as a dy- 
namics term that computes the difference between the 
current state and the predicted state given the previous 
state of the system, regularized by the state-noise co- 
variance. Therefore, the values of a'jj and Q implictly 
affect the relative importance of the reconstruction term 
and the dynamics term in the tracking formulation. As 
Q is computed during the system identification stage, 
we do not control the value of this parameter. In fact, if 
the noise covariance of a particular training system is 
large, thereby implying less robust dynamic parameters, 
the tracker will automatically give a low-weight to the 
dynamics term and a higher one to the reconstruction 
term. 

3.4 Invariance to Scale, Orientation and Direction of 
Motion 

As described at the start of this section, the spatial size, 
|i7| = r X c, of the dynamic template in the training 
video need not be the same as the size, \f2'\ = r' x 
c', of the template in the test video. Moreover, while 
tracking, the size of the tracked patch could change from 
one time instant to the next. For simplicity, we have 
only considered the case where the size of the patch 
in the test video stays constant throughout the video. 
However, to account for a changing patch size, a dynamic 
model (e.g., a random walk) for |i7^| = x cj, can easily 
be incorporated in the derivation of the optimization 
function in Eq. (24). Furthermore, certain objects such 
as flags or human actions have a specific direction of 
motion, and the direction of motion in the training video 
need not be the same as that in the test video. 

To make the tracking procedure of a learnt dynamic 
object invariant to the size of the selected patch, or 
the direction of motion, two strategies could be chosen. 
The first approach is to find a non-linear dynamical 
systems based representation for dynamic objects that 
is by design size and pose-invariant, e.g., histograms. 
This would however pose additional challenges in the 
gradient descent scheme introduced above and would 
lead to increased computational complexity. The second 
approach is to use the proposed LDS-based represen- 
tation for dynamic objects but transform the system 
parameters according to the observed size, orientation 
and direction of motion. We propose to use the second 
approach and transform the system parameters, /i and 
C, as required. 

Transforming the system parameters of a dynamic 
texture to model the transformation of the actual dy- 
namic texture was first proposed by Ravichandran and 
Vidal (2011), where it was noted that two videos of 
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the same dynamic texture taken from different view 
points could be registered by computing the transforma- 
tion between the system parameters learnt individually 
from each video. To remove the basis ambiguity ^ , we 
first follow Ravichandran and Vidal (2011) and convert 
all system parameters to the Jordan Canonical Form 
(JCF). If the system parameters of the training video 
are M = (/i, A^C^B) ^ fi ^ is in fact the stacked 
mean template image, /i™ G R^^^. Similarly, we can 
imagine the matrix C = [Ci, C2, . . . , Cn] G W^^^ as a 
composition of basis images Cf^ G W^^^i G {l,...,n}. 

Given an initialized bounding box around the test 
patch, we transform the observation parameters /i, C 
learnt during the training stage to the dimension of 
the test patch. This is achieved by computing {fi'Y^ = 
/i^"^(T(x)) and (q)^"" = q""(T(x)), z G 1, . . . , n, where 
T{x) is the corresponding transformation on the image 
domain. For scaling, this transformation is simply an 
appropriate scaling of the mean image, /i^"^ and the basis 
images C^^ from r x c to r^ x images using bilinear 
interpolation. Since the dynamics of the texture of the 
same type are assumed to be the same, we only need to 
transform /i and C. The remaining system parameters, 
A, 5, and (t|^, stay the same. For other transformations, 
such as changes in direction of motion, the corresponding 
transformation T{x) can be applied to the learnt /i, C, 
system parameters before tracking. In particular, for 
human actions, if the change in the direction of motion 
is simply from left-to-right to right-to- left, /i^"^, and C^^ 
only need to be reflected across the vertical axis to get 
the transformed system parameters for the test video. 

A Note on Discriminative Methods. In the previ- 
ous development, we have only considered foreground 
feature statistics. Some state-of-the-art methods also 
use background feature statistics and adapt the track- 
ing framework according to changes in both foreground 
and background. For example, Collins et al. (2005a) 
compute discriminative features such as foreground-to- 
background feature histogram ratios, variance ratios, 
and peak difference followed by Meanshift tracking for 
better performance. Methods based on tracking using 
classifiers Grabner et al. (2006), Babenko et al. (2009) 
also build features that best discriminate between fore- 
ground and background. Our framework can be eas- 
ily adapted to such a setting to provide even better 
performance. We will leave this as future work as our 
proposed method, based only on foreground statistics, 

^ The time series, {yt}J=i, can be generated by the system 
parameters, (/x, A, C, B), and the corresponding state sequence 
{xt}^i, or by system parameters (jj,, PAP~^ ,CP~^ , PB), 
and the state sequence, {P^t}f=i- This inherent non- 
uniqueness of the system parameters given only the observed 
sequence is referred to as the basis ambiguity. 



already provides results similar to or better than the 
state-of-the-art. 



4 Tracking and Recognition of Dynamic 
Templates 

The proposed generative approach presented in §3 has 
another advantage. As we will describe in detail in this 
section, we can use the value of the objective function 
in Eq. (24) to perform simultaneous tracking and recog- 
nition of dynamic templates. Moreover, we can learn 
the LDS parameters of the tracked dynamic template 
from the corresponding bounding boxes and compute 
a system-theoretic distance to all the LDS parameters 
from the training set. This distance can then be used as 
a discriminative cost to simultaneously provide the best 
tracks and class label in a test video. In the following 
we propose three different approaches for tracking and 
recognition of dynamic templates using the tracking 
approach presented in the previous section at their core. 

Recognition using tracking objective function. 

The dynamic template tracker computes the optimal 
location and state estimate at each time instant by 
minimizing the objective function in Eq. (24) given 
the system parameters, M. = [ji^ A^C^B)^ of the dy- 
namic template, and the kernel histogram variance, cr'^. 
From here on, we will use the more general expression, 
M = (/i, A, C, Q, i?), to describe all the tracker param- 
eters, where Q = BB^ is the covariance of the state 
noise process and R is the covariance of the observed 
image function, e.g., in our proposed kernel-based frame- 
work, R = o-^I. Given system parameters for multiple 
dynamic templates, for example, multiple sequences of 
the same dynamic texture, or sequences belonging to 
different classes, we can track a dynamic texture in a 
test video using each of the system parameters. For each 
tracking experiment, we will obtain location and state 
estimates for each time instant as well as the value of 
the objective function at the computed minima. At each 
time instant, the objective function value computes the 
value of the negative logarithm of the posterior probabil- 
ity of the location and state given the system parameters. 
We can therefore compute the average value of the ob- 
jective function across all frames and use this dynamic 
template reconstruction cost as a measure of how close 
the observed dynamic template tracks are to the model 
of the dynamic template used to track it. 

More formally, given a set of training system param- 
eters, {A4i}fLi corresponding to a set of training videos 
with dynamic templates with class labels, {Ci}fLi^ and 
test sequence j 7^ z, we compute the optimal tracks and 
states, {(i?''\x?''^)}fli for ah z G 1, . . .,NJ ^ i and 
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the corresponding objective function values, 

0} = ^J20}(^"^'^'\ (29) 

^ t=l 

where xp'*^) represents the value of the objec- 

tive function in Eq. (24) computed at the optimal 1^'*^ 

^ (i i) 

and , when using the system parameters Mi = 
{fii^Ai^ Ci^Qi^Ri) corresponding to training sequence i 
and tracking the template in sequence j ^ i. The value 
of the objective function represents the dynamic tem- 
plate reconstruction cost for the test sequence, j, at the 
computed locations as modeled by the dynam- 

ical system Mi. System parameters that correspond to 
the dynamics of the same class as the observed template 
should therefore give the smallest objective function 
value whereas those that correspond to a different class 
should give a greater objective function value. Therefore, 
we can also use the value of the objective function as a 
classifier to simultaneously determine the class of the 
dynamic template as well as its tracks as it moves in a 
scene. The dynamic template class label is hence found 
as Cj = >C/e, where k = argmin^ Oj, i.e., the label of the 
training sequence with the minimum objective function 
value. The corresponding tracks {1^'^^}^;^ nsed as 
the final tracking result. Our tracking framework there- 
fore allows us to perform simultaneous tracking and 
recognition of dynamic objects. We call this method for 
simultaneously tracking and recognition using the objec- 
tive function value. Dynamic Kernel SSD - Tracking and 
Recognition using Reconstruction (DK-SSD-TR-R). 

Tracking then recognizing. In a more traditional 
dynamic template recognition framework, it is assumed 
that the optimal tracks, {l-^}^^ for the dynamic tem- 
plate have already been extracted from the test video. 
Corresponding to these tracks, we can extract the se- 
quence of bounding boxes, Yj = {yt(l^)}^i, and learn 
the system parameters, Mj for the tracked dynamic 
template using the approach described in §3.1. We can 
then compute a distance between the test dynamical 
system, Mj, and all the training dynamical systems, 
{Mi}^i = l...N^j 7^ i. A commonly used distance 
between two linear dynamical systems is the Martin 
distance. Cock and Moor (2002), that is based on the 
subspace angles between the observability subspaces of 
the two systems. The Martin distance has been shown, 
e.g., in (Chaudhry and Vidal, 2009; Doretto et al., 2003; 
Bissacco et al., 2001), to be discriminative between dy- 
namical systems belonging to several different classes. 
We can therefore use the Martin distance with Nearest 
Neighbors as a classifier to recognize the test dynamic 
template by using the optimal tracks, {Ip'^^j^i, from 



the first method, DK-SSD-TR-R. We call this track- 
ing then recognizing method Dynamic Kernel SSD - 
Tracking then Recognizing (DK-SSD-T+R). 

Recognition using LDS distance classifier. As we 

will show in the experiments, the reconstruction cost 
based tracking and recognition scheme, DK-SSD-TR-R, 
works very well when the number of classes is small. How- 
ever, as the number of classes increases, the classification 
accuracy decreases. The objective function value itself 
is in fact not a very good classifier with many classes 
and high inter class similarity. Moreover, the tracking 
then recognizing scheme, DK-SSD-T+R, disconnects 
the tracking component from the recognition part. It 
is possible that tracks that are slightly less optimal ac- 
cording to the objective function criterion may in fact 
be better for classification. To address this limitation, 
we propose to add a classification cost to our original 
objective function and use a two-step procedure that 
computes a distance between the dynamical template 
as observed through the tracked locations in the test 
video and the actual training dynamic template. This 
is motivated by the fact that if the tracked locations 
in the test video are correct, and a dynamical-systems 
based distance between the observed template in the 
test video and the training template is small, then it is 
highly likely that the tracked dynamic texture in the 
test video belongs to the same class as the training video. 
Minimizing a reconstruction and classification cost will 
allow us to simultaneously find the best tracks and the 
corresponding label of the dynamic template. 

Assume that with our proposed gradient descent 
scheme in Eq. (26), we have computed the optimal tracks 
and state estimates, xp'*^}£i, for all frames in 

test video j, using the system parameters corresponding 
to training dynamic template Mi. As described above, 
we can then extract the corresponding tracked regions, 
Yj = {y^(ip'*^)}^i, and learn the system parameters 
M'j = {fi'j ^ Aj ^ Cj , Q'j) using, the system identification 
method in §3.1. If the dynamic template in the observed 
tracks, M^ is similar to the training dynamic template, 
Mi^ then the distance between M'j and Mi should be 
small. Denoting the Martin distance between two dy- 
namical systems, Mi^M2 as dM{Mi, M2)^ we propose 
to use the classification cost, 

C]=dM{M),M,). (30) 

Specifically, we classify the test video as Cj = Ck, where 
k = argmin^ Cj, and use the extracted tracks, {1^'^^}^;^ 
as the final tracks. We call this method for simultaneous 
tracking and recognition using a classification cost as 
Dynamic-Kernel SSD - Tracking and Recognition using 
Classifier (DK-SSD-TR-C). As we will show in our ex- 
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Fig. 5 Median state estimation error across 10 randomly gen- 
erated dynamic textures with the initial state computed using 
the pseudo- inverse method in Eq. (28) using our proposed, 
DK-SSD-T (blue), Extended Kalman Filter (EKF) (green), 
and Condensation Particle Filter (PF) (red). The horizontal 
dotted lines represent 1-, 2-, and 3-standard deviations for 
the norm of the state noise. 



Fig. 6 Median state-error for 10 random initializations for the 
initial state when estimating the state of the same dynamic 
texture, using our proposed, DK-SSD-T (blue). Extended 
Kalman Filter (EKF) (green), and Condensation Particle 
Filter (PF) (red). The horizontal dotted lines represent 1-, 2-, 
and 3-standard deviations for the norm of the state noise. 



DK-SSD-T - Mean and standard deviation in state estimation 



EKF - Mean and standard deviation in state estimation 



PF - IVIean and standard deviation in state estimation 



10 20 30 



(a) DK-SSD-T 




(b) Extended Kalman Filter 



(c) Particle Filter 



Fig. 7 Mean and 1-standard deviation of state estimation errors for different algorithms across 10 randomly generated dynamic 
textures with the initial state computed using the pseudo-inverse method in Eq. (28). These figures correspond to the median 
results shown in Fig. 5. 




Fig. 8 Mean and 1-standard deviation of state estimation errors for different algorithms 10 random initializations for the initial 
state when estimating the state of the same dynamic texture. These figures correspond to the median results shown in Fig. 6. 
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periments, DK-SSD-TR-C gives state-of-the-art results 
for tracking and recognition of dynamic templates. 

5 Empirical evaluation of state convergence 

In §3.3, we discussed the convergence properties of our 
algorithm. Specifically, we noted that if the template 
was static or the true output of the dynamic template 
were known, our proposed algorithm is equivalent to the 
standard meanshift tracking algorithm. In this section, 
we will numerically evaluate the convergence of the state 
estimate of the dynamic template. 

We generate a random synthetic dynamic texture 
with known system parameters and states at each time 
instant. We then fixed the location of the texture and 
assumed it was known a-priori thereby reducing the 
problem to only the estimation of the state given correct 
measurements. This is also the common scenario for 
state-estimation in controls theory. Fig. 5 shows the me- 
dian error in state estimation for 10 randomly generated 
dynamic textures using the initial state computation 
method in Eq. (28) in each case. For each of the systems, 
we estimated the state using our proposed method. Dy- 
namic Kernel SSD Tracking (DK-SSD-T) shown in blue. 
Extended Kalman Filter (EKE) shown in green, and 
Condensation Particle Filter (PF), with 100 particles, 
shown in red, using the same initial state. Since the 
state, xt is driven by stochastic inputs with covariance 
Q, we also display horizontal bars depicting 1-, 2-, and 
3-standard deviations of the norm of the noise process 
to measure the accuracy of the estimate. As we can see, 
at all time-instants, the state estimation error using our 
method remains within 1- and 2-standard deviations 
of the state noise. The error for both EKF and PF, on 
the other hand, increases with time and becomes much 
larger than 3-standard deviations of the noise process. 

Figs. 7(a)-7(c) show the mean and standard devia- 
tion of the state estimates across all ten runs for DK- 
SSD-T, EKF and PF respectively. As we can see, our 
method has a very small standard deviation and thus all 
runs convege to within 1- and 2-standard deviations of 
the noise process norm. EKF and PF on the other hand, 
not only diverge from the true state but the variance 
in the state estimates also increases with time, thereby 
making the state estimates very unreliable. This is be- 
cause our method uses a gradient descent scheme with 
several iterations to look for the (local) minimizer of the 
exact objective function, whereas the EKF only uses 
a linear approximation to the system equations at the 
current state and does not refine the state estimate any 
further at each time-step. With a finite number of sam- 
ples, PF also fails to converge. This leads to a much 
larger error in the EKF and PF. The trade-off for our 



method is its computational complexity. Because of its 
iterative nature, our algorithm is computationally more 
expensive as compared to EKF and PF. On average it 
requires between 25 to 50 iterations for our algorithm 
to converge to a state estimate. 

Similar to the above evaluation. Fig. 6 shows the 
error in state estimation, for 10 different random initial- 
izations of the initial state, xq, for one specific dynamic 
textures. As we can see, the norm of the state error is 
very large initially, but for our proposed method, as time 
proceeds the state error converges to below the state 
noise error. However the state error for EKF and PF 
remain very large. Figs. 8(a)-8(c) show the mean and 
standard deviation bars for the state estimation across 
all 10 runs. Again, our method converges for all 10 runs, 
whereas the variance of the state-error is very large 
for both EKF and PF. These two experiments show 
that choosing the initial state using the pseudo-inverse 
method results in very good state estimates. Moreover, 
our approach is robust to incorrect state initializations 
and will eventually converge to the correct state with 
under 2 standard deviations of the state noise. 

In summary, the above evaluation shows that even 
though our method is only guaranteed to converge to 
a local minimum when estimating the internal state of 
the system, in practice, it performs very well and always 
converges to an error within two standard deviations 
of the state noise. Moreover, our method is robust to 
incorrect state initializations. Finally, since our method 
iteratively refines the state estimate at each time instant, 
it performs much better than traditional state estimation 
techniques such as the EKF and PF. 

6 Experiments on Tracking Dynamic Textures 

We will now test our proposed algorithm on several syn- 
thetic and real videos with moving dynamic textures and 
demonstrate the tracking performance of our algorithm 
against the state-of-the-art. The full videos of the corre- 
sponding results can be found at http : //www . cis . jhu . 
edu/~rizwanch/dynamicTrackingIJCVll using the fig- 
ure numbers in this paper. 

We will also compare our proposed dynamic tem- 
plate tracking framework against traditional kernel- 
based tracking methods such as Meanshift (Comaniciu 
et al., 2003), as well as the improvements suggested in 
Collins et al. (2005a) that use features such as histogram 
ratio and variance ratio of the foreground versus the 
background before applying the standard Meanshift al- 
gorithm. We use the publicly available VIVID Tracking 
Testbed^ (Collins et al., 2005b) for implementations of 

^ http : //vision. cse .psu. edu/data/vividEval/ 
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these algorithms. We also compare our method against 
the publicly available^ Online Boosting algorithm first 
proposed in Grabner et al. (2006). As mentioned in the 
introduction, the approach presented in Peteri (2010) 
also addresses dynamic texture tracking using optical 
flow methods. Since the authors of Peteri (2010) were 
not able to provide their code, we implemented their 
method on our own to perform a comparison. We would 
like to point out that despite taking a lot of care while 
implementing, and getting in touch with the authors 
several times, we were not able to get the same results as 
those shown in Peteri (2010). However, we are confident 
that our implementation is correct and besides specific 
parameter choices, accurately follows the approach pro- 
posed in Peteri (2010). Finally, for a fair comparison 
between several algorithms, we did not use color features 
and we were able to get very good tracks without using 
any color information. 

For consistency, tracks for Online Boosting (Boost) 
are shown in magenta. Template Matching (TM) in 
yellow, Meanshift (MS) in black, Meanshift with Vari- 
ance Ratio (MS-VR) and Histogram Ratio (MS-HR) 
in blue and red respectively. Tracks for Particle Filter- 
ing for Dynamic Textures (DT-PF) are shown in light 
brown, and the optimal tracks for our method. Dynamic 
Kernel SSD Tracking (DK-SSD-T), are shown in cyan 
whereas any ground truth is shown in green. To better 
illustrate the difference in the tracks, we have zoomed 
in to the active portion of the video. 

6.1 Tracking Synthetic Dynamic Textures 

To compare our algorithm against the state-of-the-art 
on dynamic data with ground-truth, we first create syn- 
thetic dynamic texture sequences by manually placing 
one dynamic texture patch on another dynamic texture. 
We use sequences from the DynTex database (Peteri 
et al., 2010) for this purpose. The dynamics of the fore- 
ground patch are learnt offline using the method for 
identifying the parameters, (/i, A, C, 5, i?), in Doretto 
et al. (2003). These are then used in our tracking frame- 
work. 

In Fig. 9, the dynamic texture is a video of steam 
rendered over a video of water. We see that Boost, DT- 
PF, and MS-HR eventually lose track of the dynamic 
patch. The other methods stay close to the patch how- 
ever, our proposed method stays closest to the ground 
truth till the very end. In Fig. 10, the dynamic texture 
is a sequence of water rendered over a different sequence 
of water with different dynamics. Here again. Boost and 

^ http : //www . vision . ee . ethz . ch/boost ingTrackers/ 
index . htm 



Algorithm 


Fig. 9 


Fig. 10 


Boost 


389 ± 149 


82 ± ± 44 


TM 


50 ± 13 


78 ± 28 


MS 


12 ± 10 


10 ± 8.6 


MS-VR 


9.2 ± 4.3 


3.2 ± 2.0 


MS-HR 


258 ± 174 


4.6 ± 2.3 


DT-PF 


550 ± 474 


635 ± 652 


DK-SSD-T 


8.6 ± 6.8 


6.5 ± 6.6 



Table 1 Mean pixel error with standard deviation between 
tracked location and ground truth. 

TM lose tracks. DT-PF stays close to the patch ini- 
tially but then diverges significantly. The other trackers 
manage to stay close to the dynamic patch, whereas 
our proposed tracker (cyan) still performs at par with 
the best. Fig. 11 also shows the pixel location error at 
each frame for all the trackers and Table 1 provides 
the mean error and standard deviation for the whole 
video. Overall, MS-VR and our method seem to be the 
best, although MS-VR has a lower standard deviation in 
both cases. However note that our method gets similar 
performance without the use of background information, 
whereas MS-VR and MS-HR use background informa- 
tion to build more discriminative features. Due to the 
dynamic changes in background appearance. Boost fails 
to track in both cases. Even though DT-PF is designed 
for dynamic textures, upon inspection, all the particles 
generated turn out to have the same (low) probability. 
Therefore, the tracker diverges. Given the fact that our 
method is only based on the appearance statistics of the 
foreground patch, as opposed to the adaptively changing 
foreground/background model of MS-VR, we attribute 
the comparable performance (especially against other 
foreground-only methods) to the explicit inclusion of 
foreground dynamics. 

6.2 Tracking Real Dynamic Textures 

To test our algorithm on real videos with dynamic tex- 
tures, we provide results on three different scenarios 
in real videos. We learn the system parameters of the 
dynamic texture by marking a bounding box around 
the texture in a video where the texture is stationary. 
We then use these parameters to track the dynamic 
texture in a different video with camera motion caus- 
ing the texture to move around in the scene. All the 
trackers are initialized at the same location in the test 
video. Fig. 12 provides the results for the three examples 
by alternatively showing the training video followed by 
the tracker results in the test video. We will use (row, 
column) notation to refer to specific images in Fig. 12. 

Candle Flame. This is a particularly challenging scene 
as there are multiple candles with similar dynamics and 
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Fig. 10 Tracking water with different appearance dynamics on water waves. [Boost (magenta), TM (yellow), MS (black), 
MS-VR (blue) and MS-HR (red), DT-PF (light brown), DK-SSD-T (cyan). Ground-truth (green)] 




appearance and it is easy for a tracker to switch between 
candles. As we can see from (2,2), MS-HR and MS-VR 
seem to jump around between candles. Our method also 
jumps to another candle in (2,3) but recovers in (2,4), 
whereas MS is unable to recover. DT-PF quickly loses 
track and diverges. Overall, all trackers, except DT-PF 
seem to perform equally well. 

Flags. Even though the flag has a distinct appearance 
compared to the background, the movement of the flag 
fluttering in the air changes the appearance in a dynamic 
fashion. Since our tracker has learnt an explicit model 
of the dynamics of these appearance changes, it stays 
closest to the correct location while testing. Boost, DT- 
PF, and the other trackers deviate from the true location 
as can be seen in (4,3) and (4,4). 



Fire. Here we show another practical application of 
our proposed method: tracking fire in videos. We learn a 
dynamical model for fire from a training video which is 
taken in a completely different domain, e.g., a campfire. 
We then use these learnt parameters to track fire in 
a NASCAR video as shown in the last row of Fig. 12. 
Our foreground-only dynamic tracker performs better 
than MS and TM, the other foreground-only methods 
in comparison. Boost, MS-VR and MS-HR use back- 
ground information, which is discriminative enough in 
this particular video and achieve similar results. DT-PF 
diverges from the true location and performs the worst. 
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(a) Training video with labeled stationary patch for learning candle flame dynamic texture system parameters. 




(b) Test video with tracked locations of candle flame. 




(c) Training video with labeled stationary patch for learning flag dynamic texture system parameters. 




(d) Test video with tracked locations of flag. 



4 ■ ^ ^ a 

^■fc- ^Wfc. --flBk ^if^ 



(e) Training video with labeled stationary patch for learning fire dynamic texture system parameters. 




(f) Test video with tracked locations of fire. 

Fig. 12 Training and Testing results for dynamic texture tracking. Boost (magenta), TM (yellow), MS (black), MS-VR (blue) 
and MS-HR (red), DT-PF (light brown), DK-SSD-T (cyan). Training (green). 
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Fig. 13 Using the dynamics of optical flow to track a human action. Parts-based human detector (red broken), Boost (magenta), 
TM (yellow), MS (black), MS-VR (blue) and MS-HR (red), DK-SSD-T (cyan). First row: Tracking a walking person, without 
any pre-processing for state-of-the-art methods. Second row: Tracking a walking person, with pre-processing for state-of-the-art 
methods. Third row: Tracking a running person, with pre-processing for state-of-the-art methods. 



7 Experiments on Tracking Human Actions 

To demonstrate that our framework is general and can 
be applied to track dynamic visual phenomenon in any 
domain, we consider the problem of tracking humans 
while performing specific actions. This is different from 
the general problem of tracking humans, as we want to 
track humans performing specific actions such as walking 
or running. It has been shown, e.g., in Efros et al. (2003); 
Chaudhry et al. (2009); Lin et al. (2009) that the optical 
flow generated by the motion of a person in the scene 
is characteristic of the action being performed by the 
human. In general global features extracted from optical 
flow perform better than intensity-based global features 
for action recognition tasks. The variation in the optical 
flow signature as a person performs an action displays 
very characteristic dynamics. We therefore model the 
variation in optical flow generated by the motion of a 
person in a scene as the output of a linear dynamical 
system and pose the action tracking problem in terms 
of matching the observed optical flow with a dynamic 
template of flow fields for that particular action. 

We collect a dataset of 55 videos, each containing a 
single human performing either of two actions (walking 
and running). The videos are taken with a stationary 
camera, however the background has a small amount of 
dynamic content in the form of waving bushes. Simple 
background subtraction would therefore lead to erro- 
neous bounding boxes. We manually extract bounding 
boxes and the corresponding centroids to mark the loca- 



tion of the person in each frame of the video. These are 
then used as ground-truth for later comparisons as well 
as to learn the dynamics of the optical flow generated 
by the person as they move in the scene. 

For each bounding box centered at 1^, we extract 
the corresponding optical flow J^{lt) = [Fx(lt)^ Fy(lt)], 
and model the optical flow time-series, {J^{lt)}JLi as 
a Linear Dynamical System. We extract the system 
parameters, (/i. A, C, Q, R) for each optical flow time- 
series using the system identification method in §3.1. 
This gives us the system parameters and ground-truth 
tracks for each of the 55 human action samples. 

Given a test video, computing the tracks and internal 
state of the optical-flow dynamical system at each time 
instant amounts to minimizing the function, 



0(lt,xt)=— ||^(lt)-(Ai + Cxt)f+ 

^(xt - Axt-iVQ~'^{xt - Axt-i). 



(31) 



To find the optimal 1^, and x^, Eq. (31) is optimized in 
the same gradient descent fashion as Eq. (24). 

We use the learnt system parameters in a leave- 
one-out fashion to track the activity in each test video 
sequence. Taking one sequence as a test sequence, we 
use all the remaining sequences as training sequences. 
The flow-dynamics system parameters extracted from 
each training sequence are used to track the action in 
the test sequence. Therefore, for each test sequence 
j e {1, . . . , A^}, we get — 1 tracked locations and 
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Fig. 14 Comparison of various tracking methods without (top row) and with (bottom row) background subtraction. Figures 
on right are zoomed-in versions of figures on left. 



state estimate time-series by using all the remaining 
— 1 extracted system parameters. As described in §4, 
we choose the tracks that give the minimum objective 
function value in Eq. (29). 

Fig. 13 shows the tracking results against the state- 
of-the-art algorithms. Since this is a human activity 
tracking problem, we also compare our method against 
the parts-based human detector of Felzenszwalb et al. 
(2010) when trained on the PASCAL human dataset. 
We used the publicly available^ code for this comparison 
with default parameters and thresholds^. The detection 
with the highest probability is used as the location of 
the human in each frame. 

The first row in Fig. 13 shows the results for all 
the trackers when applied to the test video. The state- 
of-the-art trackers do not perform very well. In fact 

^ http : //people . cs .uchicago . edu/~pff /latent/ 

^ It might be possible to achieve better detections on this 
dataset by tweaking parameters/thresholds. However we did 
not attempt this as it is not the focus of our paper. 



foreground-only trackers, MS, TM and DT-PF, lose 
tracks altogether. Our proposed method (cyan) gives 
the best tracking results and the best bounding box 
covering the person across all frames. The parts-based 
detector at times does not give any responses or spurious 
detections altogether, whereas Boost, MS-VR and MS- 
HR do not properly align with the human. Since we use 
optical flow as a feature, it might seem that there is an 
implicit form of background subtraction in our method. 
As a more fair comparison, we performed background 
subtraction as a pre-processing step on all the test videos 
before using the state-of-the-art trackers. The second 
row in Fig. 13 shows the tracking results for the same 
walking video as in row 1. We see that the tracking 
has improved but our tracker still gives the best results. 
The third row in Fig. 13 shows tracking results for a 
running person. Here again, our tracker performs the 
best whereas Boost and DT-PF perform the worst. 

To derive quantitative conclusions about tracker per- 
formance. Fig. 14 shows the median tracker location 
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(b) DK-SSD-TR-C 



Fig. 15 Simultaneous Tracking and Recognition results showing median tracking error and classification results, when using 
(a) DK-SSD-TR-R: the objective function in Eq. (29), and (b) DK-SSD-TR-C: the Martin distance between dynamical systems 
in Eq. (30) with a 1-NN classifier, (walking (blue), running (red)). 



error for each video sorted in ascending order. The best 
method should have the smallest error for most of the 
videos. As we can see, both without (1st row) and with 
background subtraction (2nd row), our method provides 
the smallest median location error against all state of 
the art methods for all except 5 sequences. Moreover, as 
a black box and without background subtraction as a 
pre-processing step, all state-of-the-art methods perform 
extremely poorly. 



8 Experiments on Simultaneous Action 
Tracking and Recognition 

In the previous section, we have shown that given train- 
ing examples for actions, our tracking framework, DK- 
SSD-T, can be used to perform human action tracking 
using system parameters learnt from training data with 
correct tracks. We used the value of the objective func- 
tion in Eq. (31) to select the tracking result. In this 
section, we will extensively test our simultaneous track- 
ing and classification approach presented in §4 on the 
two-action database introduced in the previous section 
as well as the commonly used Weizmann human action 
dataset (Gorelick et al., 2007). We will also show that we 
can learn the system parameters for a class of dynamic 
templates from one database and use it to simultane- 
ously track and recognize the template in novel videos 
from other databases. 



8.1 Walking/ Running Database 

Fig. 15 shows the median pixel tracking error of each test 
sequence using the leave-one-out validation described in 
the previous section for selecting the tracks, sorted by 
true action class. The first 28 sequences belong to the 
class walking^ while the remaining 27 sequences belong to 
the running class. Sequences identified by our proposed 
approach as walking are colored blue, whereas sequences 
identified as running are colored red. Fig. 15(a) shows 
the tracking error and classification result when using 
DK-SSD-TR-R, i.e., the objective function value in Eq. 
(29) with a 1-NN classifier to simultaneously track and 
recognize the action. The tracking results shown also 
correspond to the ones shown in the previous section. As 
we can see, for almost all sequences, the tracking error is 
within 5 pixels from the ground truth tracks. Moreover, 
for all but 4 sequences, the action is classified correctly, 
leading to an overall action classification rate of 93%. 
Fig. 15(b) shows the tracking error and class labels when 
we using DK-SSD-TR-C, i.e., the Martin distance based 
classifier term proposed in Eq. (30) to simultaneously 
track and recognize the action. As we can see, DK-SSD- 
TR-C results in even better tracking and classification 
results. Only two sequences are mis-classified for an 
overall recognition rate of 96%. 

To show that our simultaneous tracking and clas- 
sification framework computes the correct state of the 
dynamical system for the test video given that the class 
of the training system is the same, we illustrate several 
components of the action tracking framework for two 
cases: a right to left walking person tracked using dy- 
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(a) Intensity and optical flow frame with ground-truth tracks (b) Intensity and optical flow frame with ground-truth tracks 
(green) and tracked outputs of our algorithm (blue) on test (green) used to train system parameters for a walking model, 
video with walking person tracked using system parameters 
learnt from the walking person video in (b). 



(c) Optical flow bounding boxes at ground truth locations. 



It 



(d) Optical flow bounding boxes at tracked locations. 



VMJUUHIUi 

(e) Optical flow as generated by the computed states at corresponding time instants at the tracked locations. 

(f) Optical flow bounding boxes from the training video used to learn system parameters. 

Fig. 16 Tracking a walking person using dynamical system parameters learnt from another walking person with opposite 
walking direction. The color of the optical flow diagrams represents the direction (e.g., right to left is cyan, right to left is red) 
and the intensity represents the magnitude of the optical flow vector. 



namical system parameters learnt from another walking 
person moving left to right, in Fig. 16, and the same per- 
son tracked using dynamical system parameters learnt 
from a running person in Fig. 17. 

Fig. 16(a) shows a frame with its corresponding op- 
tical flow, ground truth tracks (green) as well as the 
tracks computed using our algorithm (blue). As we can 
see the computed tracks accurately line-up with the 
ground-truth tracks. Fig. 16(b) shows a frame and corre- 
sponding optical flow along with the ground-truth tracks 
used to extract optical flow bounding boxes to learn 
the dynamical system parameters. Fig. 16(c) shows the 
optical flow extracted from the bounding boxes at the 
ground-truth locations in the test video at intervals of 5 
frames and Fig. 16(d) shows the optical flow extracted 



from the bounding boxes at the tracked locations at the 
same frame numbers. As the extracted tracks are very 
accurate, the flow-bounding boxes line up very accu- 
rately. Since our dynamical system model is generative, 
at each time-instant, we can use the computed state, 
x^, to generate the corresponding output yt = /i + Cx^. 
Fig. 16(e) displays the optical flow computed in this 
manner at the corresponding frames in Fig. 16(d). We 
can see that the generated flow appears like a smoothed 
version of the observed flow at the correct location. This 
shows that the internal state of the system was correctly 
computed according to the training system parameters, 
which leads to accurate dynamic template generation 
and tracking. Fig. 16(f) shows the optical flow at the 
bounding boxes extracted from the ground-truth loca- 
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(a) Intensity and optical flow frame with ground-truth tracks (b) Intensity and optical flow frame with ground-truth tracks 
(green) and tracked outputs of our algorithm (red) on test video (green) used to train system parameters for a running model, 
with walking person tracked using system parameters learnt 
from the running person video in (b). 



All 



(c) Optical flow bounding boxes at ground truth locations. 
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(d) Optical flow bounding boxes at tracked locations. 
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(e) Optical flow as generated by the computed states at corresponding time instants at the tracked locations. 

iiU IMM 

(f) Optical flow bounding boxes from the training video used to learn system parameters. 

Fig. 17 Tracking a walking person using dynamical system parameters learnt from a running person. The color of the optical 
flow diagrams represents the direction (e.g., right to left is cyan, right to left is red) and the intensity represents the magnitude 
of the optical flow vector. 



tions in the training video. The direction of motion is 
the opposite as in the test video, however using the 
mean optical flow direction, the system parameters can 
be appropriately transformed at test time to account 
for this change in direction as discussed in §3.4. 

Fig. 17(a) repeats the above experiment when track- 
ing the same walking person with a running model. 
As we can see in Fig. 17(a), the computed tracks are 
not very accurate when using the wrong action class 
for tracking. This is also evident in the extracted flow 
bounding boxes at the tracked locations as the head 
of the person is missing from almost all of the boxes. 
Fig. 17(f) shows several bounding box optical flow from 
the training video of the running person. The states 
computed using the learnt dynamical system parame- 
ters from these bounding boxes leads to the generated 



flows in Fig. 17(e) at the frames corresponding to those 
in Fig. 17(d). The generated flow does not match the 
ground-truth flow and leads to a high objective function 
value. 



8.2 Comparison to Tracking then Recognizing 

We will now compare our simultaneous tracking and 
recognition approaches, DK-SSD-TR-R and DK-SSD- 
TR-C to the tracking then recognizing approach where 
we first track the action using standard tracking algo- 
rithms described in §6 as well as our proposed dynamic 
tracker DK-SSD-T, and then use the Martin distance 
for dynamical systems with a 1-NN classifier to classify 
the tracked action. 
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Given a test video, we compute the location of the 
person using all the trackers described in §6 and then 
extract the bounding box around the tracked locations. 
For bounding boxes that do not cover the image area, we 
zero-pad the optical flow. This gives us an optical-flow 
time-series corresponding to the extracted tracks. We 
then use the approach described in §3.1 to learn the 
dynamical system parameters of this optical flow time- 
series. To classify the tracked action, we compute the 
Martin distance of the tracked system to all the training 
systems in the database and use 1-NN to classify the 
action. We then average the results over all sequences 
in a leave-one-out fashion. 

Table 2 shows the recognition results for the 2-class 
Walking/ Running database by using this classification 
scheme after performing tracking as well as our proposed 
simultaneous tracking and recognition algorithms. We 
have provided results for both the original sequences 
as well as when background subtraction was performed 
prior to tracking. We showed in Fig. 14(a) and Fig. 
14(c) that all the standard trackers performed poorly 
without using background subtraction. Our tracking 
then recognizing method, DK-SSD-T-hR, provides ac- 
curate tracks and therefore gives a recognition rate 
of 96.36%. The parts-based human detector of Felzen- 
szwalb et al. (2010) tracker fails to detect the person 
in some frames and therefore the recognition rate is 
the worst. When using background subtraction to pre- 
process the videos, the best recognition rate is provided 
by TM at 98.18% whereas MS-HR performs at the same 
rate as our proposed method. The recognition rate of 
the other methods, except the human detector, also 
increase due to this pre-processing step. For comparison, 
if we use the ground-truth tracks to learn the dynami- 
cal system parameters and classify the action using the 
Martin distance and 1-NN classification, we get 100% 
recognition. 

Our joint-optimization scheme using the objective 
function value as the classifier, DK-SSD-TR-R, performs 
slightly worse at 92.73%, than the best tracking then 
recognizing approach. However, when we use the Martin- 
distance based classification cost in DK-SSD-TR-C, we 
get the best action classification performance, 96.36%. 
Needless to say the tracking then recognizing scheme 
is more computationally intensive and is a 2-step pro- 
cedure. Moreover, our joint optimization scheme based 
only on foreground dynamics performs better than track- 
ing then recognizing using all other trackers without pre- 
processing. At this point, we would also like to note that 
even though the best recognition percentage achieved by 
the tracking-then-recognize method was 98.18% using 
TM, it performed worse in tracking as shown in Fig. 
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Table 2 Recognition rates for the 2-class Walking/ Running 
database using 1-NN with Martin distance for dynamical 
systems after computing tracks from different algorithms 



Action 


Median ± RSE 


Action 


Median ± RSE 


Bend 


5.7 ± 1.9 


Side 


2.1 ± 0.9 


Jack 


4.2 ± 0.6 


Skip 


2.8 ± 1.1 


Jump 


1.5 ± 0.2 


Walk 


3.3 ± 2.4 


PJump 


3.7 ± 1.3 


Wavel 


14.8 ± 7.4 


Run 


4.2 ± 2.9 


Wave2 


2.3 ± 0.7 



Table 3 Median and robust standard error (RSE) (see text) 
of the tracking error for the Weizmann Action database, using 
the proposed simultaneous tracking and recognition approach 
with the classification cost. We get an overall action recognition 
rate of 92.47%. 



14(d). Overah, our method simultaneously gives the best 
tracking and recognition performance. 

8.3 Weizmann Action Database 

We also tested our simultaneous tracking and testing 
framework on the Weizmann Action database (Gorelick 
et al., 2007). This database consists of 10 actions with 
a total of 93 sequencess and contains both stationary 
actions such as jumping in place, bending etc., and non- 
stationary actions such as running, walking, etc. We 
used the provided backgrounds to extract bounding 
boxes and ground-truth tracks from all the sequences 
and learnt the parameters of the optical-flow dynamical 
systems using the same approach as outlined earlier. A 
commonly used evaluation scheme for the Weizmann 
database is leave-one-out classification. Therefore, we 
also used our proposed framework to track the action 
in a test video given the system parameters of all the 
remaining actions. 

Table 3 shows the median and robust standard er- 
ror (RSE), i.e., the square-root of the median of {{x — 
median(x))^}, of the tracking error for each class. Other 
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Fig. 18 Simultaneous Tracking and Recognition results show- 
ing median tracking error and classification results, when using 
the objective function, and when using the Martin distance 
between dynamical systems with a 1-NN classifier, (walking 
(blue), running (red)). 
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Fig. 19 Confusion matrix for leave-one-out classification on 
the Weizmann database using our simultaneous tracking and 
recognition approach. Overall recognition rate of 92.47%. 

than Wavel, all classes have a median tracking error 
under 6 pixels as well as very small deviations from 
the median error. Furthermore, we get a simultane- 
ous recognition rate of 92.47% which corresponds to 
only 7 mis-classified sequences. Fig. 19 shows the corre- 
sponding confusion matrix. Fig. 18 shows the median 
tracking error per frame for each of the 93 sequences 
in the Weizmann database. The color of the stem-plot 
indicates whether the sequence was classified correctly 
(blue) or incorrectly (red). Table 4 shows the recogni- 
tion rate of some state-of-the-art methods on the Weiz- 
mann database. However notice that, all these methods 
are geared towards recognition and either assume that 
tracking has been accurately done before the recogni- 



ivietnoa 


Recognition (%) 


Xie et al. (2011) 


95.60 


Thurau and Hlavac (2008) 


94.40 


Ikizler and Duygulu (2009) 


100.00 


Gorelick et al. (2007) 


99.60 


Niebles et al. (2008) 


90.00 


Ah and Shah (2010) 


95.75 


Ground-truth tracks (1-NN Martin) 


96.77 


Our method, DK-SSD-TR-C 


92.47 



Table 4 Comparison of different approaches for action recog- 
nition on the Weizmann database against our simultaneous 
tracking and recognition approach. 



tion is performed, or use spatio-temporal features for 
recognition that can not be used for accurate tracking. 
Furthermore, if the ground- truth tracks were provided, 
using the Martin distance between dynamical systems 
with a 1-NN classifier gives a recognition rate of 96.77%. 
Our simultaneous tracking and recognition approach is 
very close to this performance. The method by Xie et al. 
(2011) seems to be the only attempt to simultaneously 
locate and recognize human actions in videos. However 
their method does not perform tracking, instead it ex- 
tends the parts-based detector by Felzenszwalb et al. 
(2010) to explicitly consider temporal variations caused 
by various actions. Moreover, they do not have any track- 
ing results in their paper other than a few qualitative 
detection results. Our approach is the first to explicitly 
enable simultaneous tracking and recognition of dynamic 
templates that is generalizable to any dynamic visual 
phenomenon and not just human actions. 

We will now demonstrate that our simultaneous 
tracking and recognition framework is fairly general and 
we can train for a dynamic template on one database 
and use the models to test on a totally diflFerent database. 
We used our trained walking and running action models 
(i.e., the corresponding system parameters) from the 2- 
class walking/running database in §8.1 and applied our 
proposed algorithm for joint tracking and recognizing 
the running and walking videos in the Weizmann human 
action database (Gorelick et al., 2007), without any a- 
priori training or adapting to the Weizmann database. 
Of the 93 videos in the Weizmann database, there are a 
total of 10 walking and 10 running sequences. Fig. 20 
shows the median tracking error for each sequence and 
the color codes its action, running (red) and walking 
(blue). As we can see, all the sequences have under 5 
pixel median pixel error and all the 20 sequences are 
classified correctly. This demonstrates the fact that our 
scheme is general and that the requirement of learning 
the system parameters of the dynamic template is not 
necessarily a bottleneck as the parameters need not be 
learnt again for every scenario. We can use the learnt 
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n nTnn?n??nTTT 



2 4 6 8 10 12 14 16 18 20 
Video number - (1 -1 0: Run, 1 1 -20, Walk) 

Fig. 20 Tracking walking and running sequences in the Weiz- 
mann database using trained system parameters from the 
database introduced in §8.1. The ground-truth label of the 
first 10 sequences is running, while the rest are walking. The 
result of the joint tracking and recognition scheme are labeled 
as running (red) and walking (blue). We get 100% recognition 
and under 5 pixel median location error. 



system parameters from one dataset to perform tracking 
and recognition in a different dataset. 



9 Conclusions, Limitations and Future Work 

In this paper, we have proposed a novel framework for 
tracking dynamic templates such as dynamic textures 
and human actions that are modeled by Linear Dy- 
namical Systems. We posed the tracking problem as a 
maximum a-posteriori estimation problem for the cur- 
rent location and the LDS state, given the current image 
features and the previous state. By explicitly consider- 
ing the dynamics of only the foreground, we are able to 
get state-of-the-art tracking results on both synthetic 
and real video sequences against methods that use ad- 
ditional information about the background. Moreover 
we have shown that our approach is general and can be 
applied to any dynamic feature such as optical flow. Our 
method performs at par with state-of-the-art methods 
when tracking human actions. We have shown excel- 
lent results for simultaneous tracking and recognition 
of human actions and demonstrated that our method 
performs better than simply tracking then recognizing 
human actions when no pre-processing is performed 
on the test sequences. However our approach is com- 
putationally more efficient as it provides both tracks 
and template recognition at the same cost. Finally, we 
showed that the requirement of having a training set of 
system parameters for the dynamic templates is not re- 



strictive as we can train on one dataset and then use the 
learnt parameters at test time on any sequence where 
the desired action needs to be found. 

Although our simultaneous tracking and recognition 
approach has shown promising results, there are cer- 
tain limitations. Firstly, as mentioned earlier, since our 
method uses gradient descent, it is amenable to converge 
to non-optimal local minima. Having a highly non-linear 
feature function or non-linear dynamics could poten- 
tially result in sub-optimal state estimation which could 
lead to high objective function values even when the 
correct class is chosen for tracking. Therefore, if possible, 
it is better to choose linear dynamics and model system 
parameter changes under different transformations in- 
stead of modeling dynamics of transformation-invariant 
but highly non-linear features. However it must be noted 
that for robustness, some non-linearities in features such 
as kernel-weighted histograms are necessary and need 
to be modeled. Secondly since our approach requires 
performing tracking and recognition using all the train- 
ing data, it could be computationally expensive as the 
number of training sequences and the number of classes 
increases. This can be alleviated by using a smart classi- 
fier or more generic one-model-per-class based methods. 
We leave this as future work. 

In other future work, we are looking at online learn- 
ing of the system parameters of the dynamic template, 
so that for any new dynamic texture in the scene, the 
system parameters are also learnt simultaneously as the 
tracking proceeds. This way, our approach will be appli- 
cable to dynamic templates for which system parameters 
are not readily available. 
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A Derivation of the Gradient Descent Scheme 

In this appendix, we will show the detailed derivation of the 
iterations Eq. (26) to minimimize the objective function in Eq. 
(24), with p, the non-differentiable kernel weighted histogram 
replaced by our proposed different iable kernel weighted 
histogram: 

1 

'2^ 



0(it,xt) =;7VII VC(yt(it)) - VC(m + ^xt)|p+ 



(xt - Axt-i)"^Q \xt-Axt 



(32) 



Using the change in variable, = z + It, the proposed kernel 
weighted histogram for bin u, Eq. (25), 



Uytih)) =- Yl ^W- 



(0^-i(yt(z + It)) - 0u(yt(z + It))) , 
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can be written as, 
1 



(0^_i(yt(zO)-0u(yt(zO)), 



(33) 



Following the formulation in Hager et al. (2004), we define 
the sifting vector. 



0^-_i(yt(z^)) - 0^-(yt(z^)) 
.0^-_i(yt(z'^)) - 0j(yt(z'^)). 



(34) 



We can then combine these sifting vectors into the sifting 
matrix, U = [ui, . . . , u^]- Similarly, we can define the kernel 
vector, K(z) = ^ [X(zi), ^(zs), . . . , K(ziv)]^ , where N = 
|i7| and the indexing of the pixels is performed column wise. 
Therefore we can write the full kernel- weighted histogram, C 
as. 



C(yt(lt)) = U^K(z'-10 
Since U is not a function of It 
V,.(C(yt(lt))) = U"rjK, 

where. 



(35) 



(36) 



9K(z'-lt) 9K(z'-lt) 



dlt;l dlt;2 

[VK{z[-lt),VK{z'^-lt), 



,VK{W^-lt)V. (37) 



where \/K(z' — It) is the derivative of the kernel function, 
e.g., the Epanechnikov kernel in Eq. (10). Therefore the deriva- 
tive of the first kernel-weighted histogram, v^C(yt(lt)) w.r.t. It 
is, 

L = idiag(C(yt(lt)))-iu"'j;f (38) 

where diag(v) creates a diagonal matrix with v on its diagonal 
and in its off-diagonal entries. Since yt(lt) does not depend 
on the state of the dynamic template, xt, the derivative of 
the first kernel-weighted histogram w.r.t. xt is 0. 

In the same manner, the expression for the second kernel 
weighted histogram, for bin u, 



Cui/J' + Cxt) : 



AC ^ ^ 



.i(M(z) + C(z)"rxt) 



,(M(z)+X(z)^xt) 



(39) 



By using a similar sifting vector for the predicted dynamic 
template. 



- 0j)(/x(zi) + C(zi)"^xt) 

(pj-l - 0j)(/x(z2) + C(z2)"^Xt) 

_(0j-i - 0j)(M(zAr) + C(zAr)^xt)_ 
where for brevity, we use 

4>j-i(p,{z) + C(z)^xt) - + C(z)"rxt). 



(40) 



and the pixel indices Zj are used in a column-wise fashion as 
discussed in the main text. Using the corresponding sifting 
matrix, ^ = [(^i , ^^2, • • • , ^b] G M^^^^, we can write. 



C(M + Cxt) = *^diag(K(z)) 

Since C(m + C'xt) is only a function of xt, 

Vx.(C(Ai + Cxt)) = (*')^diag(K(z))C, 



(41) 



(42) 



where ^' = [^{,^2^ . . . , ^^], is the derivative of the sifting 
matrix with. 



i<P'j_,-<t>'j){fi{zi) + Ciziy^t) 

(^;_1-^;)(KZ2) + C(Z2)^X0 



J-1 



■<^;.)(M(zN) + C(zjv)"^xt). 



(43) 



and is the derivative of the sigmoid function. Therefore, the 
derivative of the second kernel-weighted histogram, ^Cil-^ + Cxt) 
w.r.t. Xt is. 



M = -diag(C(M + Cxt))-3(4.')^diag(K(z))C. 



(44) 



The second part of the cost function in Eq. (32) depends 
only on the current state, and the derivative w.r.t. Xt can be 
simply computed as. 



d = Q-i(xt - Axt-i) 



(45) 



When computing the derivatives of the squared difference 
function in first term in Eq. (32), the difference. 



VC(yt(it))- v'C(M + C7xt), 



(46) 



will be multiplied with the derivatives of the individual square 
root kernel- weighted histograms. 

Finally, the derivative of the cost function in Eq. (32) 
w.r.t. It is computed as. 



V,.0(lt,Xt): 



1 



(47) 



and the derivative of the cost function Eq. (32) w.r.t. xt is 
computed as. 



Vx,0(lt,xt) = -^{-M)^si + d. 



(48) 



We can incorporate the term in L and M to get the 

gradient descent optimization scheme in Eq. (26). 
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